-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus logger: fix potential unlimited memory usage #529
Conversation
I see some typographical errors in loggers.go - perhaps change "requeters" to "requesters" in the yaml? Also: what is the "size" referring to? Bytes, megabytes, number of items...? I assume number of items, but perhaps you have not updated the config file example yet for description. |
Thanks for typo. Works well to limit memory usage, but costs CPU The size is the number of items in the cache. requesters-cache-size: 50000
requesters-cache-ttl: 3600
domains-cache-size: 50000
domains-cache-ttl: 3600 |
if this can get merged into the pipeline branch, I'll test ASAP and report comparative CPU usage. |
This patch has been merged in the pipeline branch |
Running - looks good so far, but will know shortly when we start to evict from the LRU for domains what that does to CPU. |
Testing questions: with the defaults, the number of domains stored should never be above 500000 - correct? (in your notes above, there is a typo of 50000) I am looking at dnscollector_total_domains_lru to measure this number. Currently, the value of that counter is 579000 so something is wrong. (branch: pipeline_mode, full clone half an hour ago, changes to all metrics with "_lru" are apparent so I know I'm running the right version.) |
correct, the default value is 500k https://github.com/dmachard/go-dnscollector/blob/c7f54ac9bf32e3778b3af5ba437ab3e7f91892d6/pkgconfig/loggers.go#L337 Except if you overwrite the default value with config file ? |
I did not overwrite with the config file, so this was using defaults. However, the good news is that the number has been dropping but that may be due to timers and not to the maximum number (queries have been decreasing over the last few hours, so growth may be naturally diminishing.) This graph is plotting sum(dnscollector_total_domains_lru) for my system. It peaks well over 600k names, which is far above the 500k maximum. |
It may be worth noting that I have three feeds coming into this system from three different resolvers. Does this maximum value apply to the total number of names in memory, or is it per stream_id ? If it is the latter, then perhaps this is expected behavior. |
The maximum value is per stream_id
Okay, you have one dimension in 'dnscollector_total_domains_lru,' which is the stream_id. The maximum value in your case (with the sum) should be 1.5 million. Could you plot 'dnscollector_total_domains_lru' without the sum? |
It was almost supported so I made a minor code adjustment to consolidate all streams into one for metric computations. If you want to test, you can utilize and append the following key to your Prometheus settings: prometheus-labels: ["stream_global"] With this modification, you should hit the 500k limit of the LRU cache (stream_id label will be removed) Regarding memory and CPU usage it's ok ? P.S.: if you want to know how many domains we have in total, don't forget to also count NXDomains (dnscollector_total_nxdomains_lru) and SERVFAIL (dnscollector_sfdomains_lru) |
CPU and memory numbers look fine - no significant changes from previous behaviors. I've had an instance running for two full days - no issues, and the memory usage is staying below the thresholds presented. I will re-start with a more aggressive threshold (lower) to see if that changes my CPU loading, but I think that is just an academic exercise at this point. Do the NXDOMAIN and SERVFAIL data also fall into the "dnscollector_total_domains_lru" number? |
See my previous #529 (comment) " if you want to know how many domains we have in total, don't forget to also count NXDomains (dnscollector_total_nxdomains_lru) and SERVFAIL (dnscollector_sfdomains_lru)" Thanks for feedback, I will merge soon. |
Thank you for the comments, but I'm still not quite clear on the terminology. The term "dnscollector_total_domains_lru" would imply that is is the total of all possible subsets, regardless of rcode status. If it was only the "noerror" domains, then it would be expected that the metric would be "dnscollector_total_noerrordomains_lru". It's fine that there is no single metric that shows the counter of all noerror, nxdomain, and servfail domains across all streams. If there are three metrics (dnscollector_total_nxdomains_lru, dnscollector_sfdomains_lru, and dnscollector_total_noerrordomains_lru) that have to be added, that is fine as long as they are unique counters of non-duplicated domains in each of those categories that looks at all of the possible stream sets. In addition, having those counters (dnscollector_noerrordomains_lru, dnscollector_nxdomains_lru, dnscollector_sfdomains_lru) for each stream is useful. The sum of each of these rcode sets across streams will almost always be (confusingly) larger than the corresponding dnscollector_total_* values for each set, since I assume your code keeps each domain once, but tags it with which streams have seen the domain? Also, the presence or absence of a "stream_id" tag would imply if a metric was per-stream or not. If there was no "stream_id" tag, then I would assume it would be a de-duplicated counter of all possible domains of a particular rcode, across all streams. Sorry to be so particular about the naming here, but it does make a significant difference in how numbers are interpreted which then leads directly into the ability to manage the operation of the package in a meaningful way by a staff who may not be so specificaly intimate with the details of the code and the subtle distinctions of metric naming. Keeping Prometheus values straight is an important task in any large-scale operational considerations and I want to make sure this doesn't need to be re-done in a while after many people have already made assumptions about what the metrics mean. |
I prefer to remove any ambiguous in metrics, here my proposal:
Regarding memory usage, a LRU cache is associated to each metrics so it's must be configurable or not individually to compute them or not. Duplication entries can exists between LRU cache for metric n°2, n°3 and n°4 because for example at sometime a specific domain can be "NOERROR" and after "SERVFAIL" Keep in mind that these LRU caches are also used and mandatory to compute "top domains/requesters" in realtime with the following metrics
Regarding the |
Thank for sharing that, can you track this in a new issue ? |
yep |
This PR try to find a solution to limit memory usage with Prometheus logger.
The following list are stored in memory without any limitations:
The following metrics has been replaced